Extension Architecture And Components
This document explains the browser extension architecture and component interactions for the Open DIA project. It focuses on:
The content script’s role in executing browser actions
The background script’s coordination for cross-tab communication
The side panel UI integration and agent orchestration
Message passing protocols between extension components and the main application
The AgentExecutor component’s role in coordinating agent actions and the content script’s execution environment
Examples of lifecycle management, permission handling, and security boundaries
Cross-browser compatibility considerations and extension manifest configuration
The extension is organized into entrypoints for background, content, and side panel UI, plus shared utilities for agent orchestration and messaging.
Diagram sources
Section sources
Background script: Central coordinator for cross-tab communication, tab state, and action dispatch. Handles message routing and executes browser-level commands.
Content script: Runs in-page to manipulate DOM and respond to action requests scoped to the active tab.
Side panel UI: React-based interface that orchestrates agent execution, manages sessions, and coordinates with background and content scripts.
Agent utilities: Parse slash commands, map agents/actions to endpoints, and execute agent requests with contextual page data.
WebSocket client: Provides a minimal client for real-time agent execution and progress updates.
Section sources
The extension follows a message-passing architecture:
Side panel initiates agent execution and sends commands to the background script.
Background script resolves actions, injects content scripts when needed, and coordinates tab-level operations.
Content script performs DOM-level actions within the active tab.
Utilities parse commands, construct payloads, and capture page context for agent execution.
Diagram sources
Background Script Coordination#
Responsibilities:
Listens for messages from side panel and content script
Coordinates tab state and cross-tab communication
Executes browser-level actions (tabs, navigation, scripting injection)
Routes agent tool execution to handlers
Key message types:
ACTIVATE_AI_FRAME / DEACTIVATE_AI_FRAME: Manage overlay frames per tab
GET_ACTIVE_TAB / GET_ALL_TABS: Tab discovery and state
EXECUTE_ACTION: Dispatch actions to content script
GEMINI_REQUEST: Perform local LLM inference
RUN_GENERATED_AGENT: Execute generated action plans
EXECUTE_AGENT_TOOL: Invoke agent tools with structured payloads
and Send Action"] Type --> |GEMINI_REQUEST| Gemini["Local LLM Request"] Type --> |RUN_GENERATED_AGENT| RunPlan["Execute Action Plan"] Type --> |EXECUTE_AGENT_TOOL| Tool["Execute Agent Tool"] Type --> |Other| Unknown["Unknown Type"] Activate --> End(["Response Sent"]) Deactivate --> End GetActive --> End GetAll --> End ExecAction --> End Gemini --> End RunPlan --> End Tool --> End Unknown --> End
Diagram sources
Section sources
Content Script Execution Environment#
Role:
Runs in-page to perform DOM-level actions
Responds to action requests from background script
Provides simple page interaction helpers (play/pause video, click, fill, scroll, info)
Current capabilities:
Keyword-based action parsing for simple commands
DOM queries and synthetic events for input and click
Basic page information extraction
Trigger play/pause"] Action --> |Click| Click["Find matching button
Dispatch click"] Action --> |Type/Fill| Type["Find input/textarea
Set value and events"] Action --> |Scroll| Scroll["Scroll window by amount"] Action --> |Info| Info["Collect page metadata"] Video --> Done["Return result"] Click --> Done Type --> Done Scroll --> Done Info --> Done
Diagram sources
Section sources
Side Panel UI Integration and AgentExecutor#
Responsibilities:
Manages sessions, chat history, and UI state
Parses slash commands and maps to agent/action endpoints
Executes agent requests and triggers browser actions
Integrates with WebSocket client for real-time execution
Key flows:
Slash command parsing routes to appropriate agent endpoints
Agent execution captures page context (HTML, URL, title) when needed
Action plan execution dispatches actions to background/content scripts
Settings and authentication screens integrate with browser storage
Diagram sources
Section sources
Agent Utilities and Mapping#
Command parsing: Supports agent selection, action selection, and completion stages
Endpoint mapping: Maps agent-action pairs to backend endpoints
Execution: Builds payloads with page context, chat history, and optional attachments
Diagram sources
Section sources
WebSocket Client Integration#
Provides a simple API for real-time agent execution and progress updates
Emits connection status, progress, and result/error events
Used by the side panel to enhance agent execution UX
Diagram sources
Section sources
External dependencies and their roles:
React ecosystem: UI rendering and state management
Socket.IO client: Real-time communication with agent server
Google Generative AI SDK: Local LLM inference
Tailwind/KaTeX: UI styling and math rendering
Diagram sources
Section sources
Minimize DOM queries and synthetic event dispatches; batch actions when possible
Use timeouts and listeners for tab operations to avoid blocking
Cache page context only when necessary; avoid large payloads
Debounce UI updates and progress reporting to reduce re-renders
Prefer browser APIs (tabs, scripting) over frequent polling
Common issues and resolutions:
Action not executing in content script:
Ensure the content script is injected and the tab is active
Verify message routing from background to content script
Tab operations failing:
Confirm tab IDs and window context
Check for navigation completion before performing actions
WebSocket connectivity:
Validate server availability and CORS
Use fallback HTTP stats when WebSocket is disconnected
Permission errors:
Review manifest permissions and host permissions
Reinstall the extension after permission changes
Section sources
The extension employs a clear separation of concerns: the background script coordinates cross-tab operations, the content script handles DOM-level actions, and the side panel orchestrates agent execution with real-time feedback. Utilities provide robust command parsing and context-aware agent execution. Permissions and manifest configuration enable broad site access and side panel integration. With careful attention to performance and error handling, the architecture supports scalable agent-driven browser automation.
Message Passing Protocols#
Side panel to background:
Types: ACTIVATE_AI_FRAME, DEACTIVATE_AI_FRAME, GET_ACTIVE_TAB, GET_ALL_TABS, EXECUTE_ACTION, GEMINI_REQUEST, RUN_GENERATED_AGENT, EXECUTE_AGENT_TOOL
Background to content:
Types: PERFORM_ACTION (via tabs.sendMessage)
Background to side panel:
Responses to all requests with success/error payloads
Section sources
Lifecycle Management and Security Boundaries#
Lifecycle:
Background script initializes listeners and tab tracking
Side panel activates/deactivates AI frames and manages sessions
Content script loads per-page and responds to actions
Security:
Content scripts run in page context with limited permissions
Background script bridges privileged APIs with page contexts
Manifest permissions define scope; host permissions grant broad access
Section sources
Cross-Browser Compatibility#
Build targets:
Chrome MV3 and Firefox via WXT build flags
Differences:
Some APIs differ between browsers; use feature detection
Manifest keys and permissions may vary slightly
Section sources
Extension Manifest Configuration#
Name, description, permissions, host permissions
Permissions include tabs, storage, scripting, identity, sidePanel, webNavigation, webRequest, cookies, bookmarks, history, clipboard, notifications, contextMenus, downloads
Section sources